I’ll start by loading all the libraries needed for this Homework
The chosen dataset is called Wine and contains information about the chemical composition of many wines, including also a quality rate from 0 to 10 where 0 is a bad wine and 10 is a perfect wine. The dataset can be found in Kaggle. Some of the variables are quite understandable even if we don’t know too much about wines. I’ll explain some of them when we start with the classification and regression techniques.
Let’s now read the dataset and take a quick look at it.
df <- read.csv("./WineQT.csv")
head(df)
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 7.4 0.70 0.00 1.9 0.076
## 2 7.8 0.88 0.00 2.6 0.098
## 3 7.8 0.76 0.04 2.3 0.092
## 4 11.2 0.28 0.56 1.9 0.075
## 5 7.4 0.70 0.00 1.9 0.076
## 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality Id
## 1 5 0
## 2 5 1
## 3 5 2
## 4 6 3
## 5 5 4
## 6 5 5
We find 13 columns and 1143 observations of wines. Our prediction variable will be the quality of the wine.
Moving into the preprocessing of the dataset I will search for strange or missing values and try to manage them. Some visualization will also be done in the conclusions.
Luckily we don’t have any missing values
c <-c(colSums(sapply(df, is.na)))
x <- data.frame()
x <- rbind(c)
colnames(x) <-colnames(df)
barplot(x,las=2, main="NA's Location", col="#1B9E77", ylim = c(0,1))
grid()
We can make sure that ther aren’t any with the following lines
round(sum(is.na(df))/nrow(df)*100,2)
## [1] 0
A 0% of the dataset contains a NA, great.
NA’s are a minor problem compared to Outliers, these can harm our classification and of course our regression models. The following plot shows a boxplot for each variable on the dataset
boxplot(df,las=2, col="#1B9E77", main="Data distribution")
grid()
Firstly, we notice that the variable ID can be omitted, as it doesn’t provide any useful information
drop <- names(df) %in% c("Id")
df <- df[,!drop]
head(df)
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 7.4 0.70 0.00 1.9 0.076
## 2 7.8 0.88 0.00 2.6 0.098
## 3 7.8 0.76 0.04 2.3 0.092
## 4 11.2 0.28 0.56 1.9 0.075
## 5 7.4 0.70 0.00 1.9 0.076
## 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality
## 1 5
## 2 5
## 3 5
## 4 6
## 5 5
## 6 5
Relooking at the previous plot we
boxplot(df,las=2, col="#1B9E77", main="Data distribution")
grid()
We notice that there are outliers in every variable but, after some research I found that, as the dataset contains a great variety of wines, each one of them has a particular value for each variable, for example, better wines have a little bit more of alcohol then, poor wines have more citric acid than the rest … So every value here is important. But let’s focus on the Total Sulfur Dioxide (The one with a bigger range of values) Notice that in the column of total.sulfur.dioxide there is a great range of values including many outliers
boxplot(df$total.sulfur.dioxide, col="#1B9E77", main="Total Sulfur Dioxide")
grid()
“It preserves wine’s freshness and fruit characters by virtue of antioxidant, antimicrobial and anti-enzymatic properties.” Read more about Sulfur in wine here. This component is added in different moments of the production of wine, depending on when and how much do they add the wine will be better or worse.
Let’s visualize how our output variable (quality) is distributed
ggplot(df) +
aes(x = quality) +
geom_density(adjust = 6L, fill = "#112446") +
labs(title = "Quality Distribution") +
theme(
plot.title = element_text(size = 15L,
face = "bold",
hjust = 0.5)
) +
xlim(2, 9)
Notice that we have a quite imbalanced dataset, with plenty observations with qualitys like 5 or 6 and with smaller values the further we go from these values. Looking at the correlation between all the variables we get to the following plot.
R = cor(df)
ggcorr(df, label = T, hjust = 0.99,size = 1.5)
We can see that some of the variables have correlation between them but most just don’t, we will see how this affects our linear models in a near future.
Before we jump into some Classification techniques, let’s encode our output in order to re-balance our data. We will consider a wine “Good” if it has a quality score over or equal to 6 and contrary, if its quality is lower than 5 we will consider it as “Poor”.
We will change the quality for Good and Poor and create another variable with 1’s and 0’s to improve (a little bit) the regression techniques.
dfc <- df
dfc$quality<-replace(dfc$quality, df$quality<=5, "Poor")
dfc$quality[dfc$quality == "6"] <- "Good"
dfc$quality[dfc$quality == "7"] <- "Good" #good
dfc$quality[dfc$quality == "8"] <- "Good" #good
dfc$quality[dfc$quality == "9"] <- "Good" #good
dfc$quality[dfc$quality == "10"] <- "Good" #good
dfr <- dfc
dfr$quality[dfc$quality == "Good"] <- 1
dfr$quality[dfc$quality == "Poor"] <- 0
dfc$quality <- factor(dfc$quality)
dfr$quality <- strtoi(dfr$quality)
head(df$quality)
## [1] 5 5 5 6 5 5
head(dfc$quality)
## [1] Poor Poor Poor Good Poor Poor
## Levels: Good Poor
head(dfr$quality)
## [1] 0 0 0 1 0 0
table(dfc$quality)
##
## Good Poor
## 621 522
Notice that our initial imbalanced dataset is now a much more balanced dataset
Before any technique is carried out we must split our data in train
and test sets with 80% of the dataset for train and the remaining for
test. We will include the set.seed(237) in almost every
code-chunk in order to maintain the same results even if we run the
chunks many times. If we don’t add this piece of code, when we knit or
re-run any piece of code, the random part of the functions R calls
during the execution will make our results to change.
set.seed(237)
spl = createDataPartition(dfc$quality, p = 0.8, list = FALSE) # 80% for training
train = dfc[spl,]
test = dfc[-spl,]
spl2 = createDataPartition(dfr$quality, p = 0.8, list = FALSE) # 80% for training
train2 = dfr[spl2,]
test2 = dfr[-spl2,]
We’ve created 2 train sets and 2 test sets, one for classification and another one for regression. Let’s look at how a random variable on the train set behaves, for example sugar
ggplot(train) +
aes(x = residual.sugar, fill = quality) +
geom_histogram(bins = 30L) +
scale_fill_manual(
values = c(Good = "#7499FF",
Poor = "#FF0000")
) +
labs(
x = "Residual Sugar",
y = " ",
title = "Data distribution",
subtitle = "Train Set"
) +
theme_bw() +
theme(legend.position = "none")
We can see that is equally distributed for any value of Residual sugar
Many Classifiers will be discussed among this part of the HW, in addition at the very end a best model will be selected.
To start with, the decision tree classifier is one of the most basic classifiers, it works by creating binary partitions of the data and splitting them again on each branch. In our case we will use cross validation during the train for better results.
set.seed(237)
tree <- train(quality~ .,
data = train,
method = "rpart",
control=rpart.control(minsplit = 5, maxdepth = 10,cp=0.00009),
trControl = trainControl(method = "cv", number = 15))
tree
## CART
##
## 915 samples
## 11 predictor
## 2 classes: 'Good', 'Poor'
##
## No pre-processing
## Resampling: Cross-Validated (15 fold)
## Summary of sample sizes: 855, 854, 854, 854, 854, 854, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.04784689 0.7082760 0.4111805
## 0.08612440 0.6798766 0.3608100
## 0.32296651 0.6263247 0.2339417
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.04784689.
rpart.plot(tree$finalModel)
This model considers Alcohol and Sulphates as the most important variables, so if a sample is introduced and has more than 11% of alcohol it is considered as Good wine and the chances for it to be Poor are about a 20%, this makes sense because better wines usually are storaged for more time in wood barrels and can get more alcohol. In addition if it has less than 11% is considered as poor unless it has more than 0.63 of sulphates which are additives used in the process of transforming grapes into wine.
set.seed(237)
treepred = predict(tree, test, type = "raw")
treeB <- predict(tree, test, type = "prob")
confusionMatrix(factor(treepred), test$quality)$table
## Reference
## Prediction Good Poor
## Good 101 54
## Poor 23 50
Analizinf the confusion matrix, we have badly classified 54 wines as Good when they were bad. Focusing on metrics related to this confusion matrix
conf_matrix = table(test$quality, treepred, dnn =c("Prediction", "Reference"))
accuracy = sum(diag(conf_matrix))/sum(conf_matrix)
precision = conf_matrix[1,1]/sum(conf_matrix[,1])
specificity = conf_matrix[2,2]/sum(conf_matrix[,2])
performance = data.frame(accuracy,precision,specificity)
performance
## accuracy precision specificity
## 1 0.6622807 0.6516129 0.6849315
We get an accuracy score of around 67% not very far from a random classifier but not that bad either. Let’s see what happens when we use a very popular ML tool, Random Forest.
This algorithm gathers the idea of the decision tree but in another level, here we will use n trees where each tree can be any classification model we wish, at the end, Random Forest will label a sample with the mode of prediction of every tree used, the most voted. We prepare our code to run this algorithm
set.seed(237)
rf.train <- randomForest(quality ~., data=train,ntree=400,mtry=2,importance=TRUE)
rf.pred <- predict(rf.train, newdata=test)
confusionMatrix(rf.pred, test$quality)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Good Poor
## Good 96 24
## Poor 28 80
##
## Accuracy : 0.7719
## 95% CI : (0.7119, 0.8247)
## No Information Rate : 0.5439
## P-Value [Acc > NIR] : 0.0000000000007383
##
## Kappa : 0.5417
##
## Mcnemar's Test P-Value : 0.6774
##
## Sensitivity : 0.7742
## Specificity : 0.7692
## Pos Pred Value : 0.8000
## Neg Pred Value : 0.7407
## Prevalence : 0.5439
## Detection Rate : 0.4211
## Detection Prevalence : 0.5263
## Balanced Accuracy : 0.7717
##
## 'Positive' Class : Good
##
We get an accuracy of a 77,19% much better than the previous classifier, let’s take a deep look into which variables have more importance, remember that on the Decision Tree, Alcohol and Sulphates were the chosen ones.
r <- varImp(rf.train, scale = T)
r$Good <- NULL
names(r)<- c("Metrics")
r$Name <- rownames(r)
data.frame(r)
## Metrics Name
## fixed.acidity 11.760242 fixed.acidity
## volatile.acidity 19.647166 volatile.acidity
## citric.acid 13.638682 citric.acid
## residual.sugar 7.709138 residual.sugar
## chlorides 11.717972 chlorides
## free.sulfur.dioxide 9.777909 free.sulfur.dioxide
## total.sulfur.dioxide 19.692562 total.sulfur.dioxide
## density 14.019470 density
## pH 10.478079 pH
## sulphates 32.537533 sulphates
## alcohol 31.660481 alcohol
r$Name <- factor(r$Name , levels = r$Name [order(r$Metrics)])
ggplot(r) +
aes(x = Name, weight = Metrics) +
geom_bar(fill = "#228B22") +
labs(x = " ", y = " ", title = "Variable Importance",
caption = " ") +
coord_flip() +
theme_minimal() +
theme(plot.title = element_text(size = 17L, face = "bold",
hjust = 0.5))
Random Forest has given a lot of importance to the same variables that the decision tree has chosen, in addition fives also a lot of importance to the total ammount of total sulfur dioxide and to the volatile acidity
conf_matrix = table(test$quality, rf.pred, dnn =c("Prediction", "Reference"))
accuracy = sum(diag(conf_matrix))/sum(conf_matrix)
precision = conf_matrix[1,1]/sum(conf_matrix[,1])
specificity = conf_matrix[2,2]/sum(conf_matrix[,2])
performance = data.frame(accuracy,precision,specificity)
performance
## accuracy precision specificity
## 1 0.7719298 0.8 0.7407407
To sum up, this random forest model, is more expensive in terms of computational complexity and time but in this problem is almost unnoticeable, besides it has got great results if we compare it to the decison tree classifier.
Gradient Boosting uses an iterative process and creates a model in order to combine it with another new model, believing that it will reduce the error
set.seed(237)
GBM.train <- gbm(ifelse(train$quality=="Poor",0,1) ~., data=train, distribution= "bernoulli",n.trees=500,shrinkage = 0.01,interaction.depth=2,n.minobsinnode = 8)
gbmProb = predict(GBM.train, newdata=test, n.trees=500,type="response")
gbmProb<-ifelse(gbmProb> 0.5,"Good","Poor")
confusionMatrix(factor(gbmProb), test$quality)$table
## Reference
## Prediction Good Poor
## Good 99 34
## Poor 25 70
conf_matrix = table(test$quality, gbmProb, dnn =c("Prediction", "Reference"))
accuracy = sum(diag(conf_matrix))/sum(conf_matrix)
precision = conf_matrix[1,1]/sum(conf_matrix[,1])
specificity = conf_matrix[2,2]/sum(conf_matrix[,2])
performance = data.frame(accuracy,precision,specificity)
performance
## accuracy precision specificity
## 1 0.7412281 0.7443609 0.7368421
In this case we’ve gotten worse results than with RF, but very similar indeed.
It is a modification of the previous algorithm, it uses L1 & L2 regularization in order to get better results and minimize computing time. We will use a Cross Validation to see if we can improve the actual best classifier, the random forest.
ctrl <- trainControl(method = "cv", number = 10,
classProbs = TRUE,
verboseIter=T)
After running the previous chunk we realize that the cross validation adds some more computing time. If we look at the Importance
plot(xgb_imp, scales = list(y = list(cex = .95)))
We notice that once more, alcohol is decisive when classifying a sample, also sulphates and volatile acidity. Keeping the same results as in the Gradient Boosting
xgbProb = predict(xgb.train, newdata=test, type="raw")
confusionMatrix(xgbProb, test$quality)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Good Poor
## Good 99 39
## Poor 25 65
##
## Accuracy : 0.7193
## 95% CI : (0.6562, 0.7766)
## No Information Rate : 0.5439
## P-Value [Acc > NIR] : 0.00000004105
##
## Kappa : 0.428
##
## Mcnemar's Test P-Value : 0.1042
##
## Sensitivity : 0.7984
## Specificity : 0.6250
## Pos Pred Value : 0.7174
## Neg Pred Value : 0.7222
## Prevalence : 0.5439
## Detection Rate : 0.4342
## Detection Prevalence : 0.6053
## Balanced Accuracy : 0.7117
##
## 'Positive' Class : Good
##
We are getting a very low value for specificity, and a lower score for accuracy than in Gradient Boosting or in Random Forest, this time we are facing a 72%
Here we will train with subsamples of our dataset, in this case we will take subsets of the train set, in this case we will use it with a RF, as is the best classifier we have used so far.
set.seed(237)
ctrl <- trainControl(method = "cv", number = 10,
classProbs = TRUE,
verboseIter=T,
sampling = "up")
ctrl$sampling <- "up"
set.seed(237)
rf.train <- randomForest(quality ~., data=train,ntree=400,mtry=2,importance=TRUE, trControl=ctrl)
rf.pred <- predict(rf.train, newdata=test)
confusionMatrix(rf.pred, test$quality)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Good Poor
## Good 96 24
## Poor 28 80
##
## Accuracy : 0.7719
## 95% CI : (0.7119, 0.8247)
## No Information Rate : 0.5439
## P-Value [Acc > NIR] : 0.0000000000007383
##
## Kappa : 0.5417
##
## Mcnemar's Test P-Value : 0.6774
##
## Sensitivity : 0.7742
## Specificity : 0.7692
## Pos Pred Value : 0.8000
## Neg Pred Value : 0.7407
## Prevalence : 0.5439
## Detection Rate : 0.4211
## Detection Prevalence : 0.5263
## Balanced Accuracy : 0.7717
##
## 'Positive' Class : Good
##
We have improved very little in comparison with the original RF Let’s modify the type of subsampling and see
set.seed(237)
ctrl <- trainControl(method = "cv", number = 10,
classProbs = TRUE,
verboseIter=T)
ctrl$sampling <- "smote"
set.seed(237)
rf.train <- randomForest(quality ~., data=train,ntree=400,mtry=2,importance=TRUE, trControl=ctrl)
rf.pred <- predict(rf.train, newdata=test)
rfB <- predict(rf.train, newdata=test, type = "prob")
confusionMatrix(rf.pred, test$quality)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Good Poor
## Good 96 24
## Poor 28 80
##
## Accuracy : 0.7719
## 95% CI : (0.7119, 0.8247)
## No Information Rate : 0.5439
## P-Value [Acc > NIR] : 0.0000000000007383
##
## Kappa : 0.5417
##
## Mcnemar's Test P-Value : 0.6774
##
## Sensitivity : 0.7742
## Specificity : 0.7692
## Pos Pred Value : 0.8000
## Neg Pred Value : 0.7407
## Prevalence : 0.5439
## Detection Rate : 0.4211
## Detection Prevalence : 0.5263
## Balanced Accuracy : 0.7717
##
## 'Positive' Class : Good
##
We get a little improvement in the most relevant performance metrics, accuracy, sensitivity and specificity which is very good, in addition it usually helps to reduce the computational cost in very large datasets
r <- varImp(rf.train, scale = T)
r$Good <- NULL
names(r)<- c("Metrics")
r$Name <- rownames(r)
data.frame(r)
## Metrics Name
## fixed.acidity 11.760242 fixed.acidity
## volatile.acidity 19.647166 volatile.acidity
## citric.acid 13.638682 citric.acid
## residual.sugar 7.709138 residual.sugar
## chlorides 11.717972 chlorides
## free.sulfur.dioxide 9.777909 free.sulfur.dioxide
## total.sulfur.dioxide 19.692562 total.sulfur.dioxide
## density 14.019470 density
## pH 10.478079 pH
## sulphates 32.537533 sulphates
## alcohol 31.660481 alcohol
r$Name <- factor(r$Name , levels = r$Name [order(r$Metrics)])
ggplot(r) +
aes(x = Name, weight = Metrics) +
geom_bar(fill = "#228B22") +
labs(x = "", y = " ", title = "Variable Importance",
caption = " ") +
coord_flip() +
theme_minimal() +
theme(plot.title = element_text(size = 17L, face = "bold",
hjust = 0.5))
We can see that the importance for our main variables hasn’t changed at all.
We desire to combine the predictions of several base estimators, let’s see if we can improve RF’s results.
mode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
ensemble.pred = apply(data.frame(rfB, treeB ), 1, mode)
plot.roc(test$quality, ensemble.pred,col="darkblue", print.auc = TRUE, auc.polygon=TRUE, grid=c(0.1, 0.2),
grid.col=c("green", "red"), max.auc.polygon=TRUE,
auc.polygon.col="lightblue", print.thres=TRUE)
## Setting levels: control = Good, case = Poor
## Setting direction: controls > cases
We must state that AUC and accuracy are not the same thing but have a similar meaning, the closer the AUC gets to 1 the better our model is, contrary, if is close to 0.5 it means that the classifier behaves like a random classifier. Here we’ve combined Rf and decision tree and get almost an 0.85 value for AUC so, this model is very good.
Their goal is to minimize the probability of wrongly classifying a sample. Often used to get insights and for feature selection
LDA tries to project the features which usually are in higher dimensional spaces, 11 in our case, onto a lower-dimensional space in order to reduce resources and dimensional costs, also we assume that the covariance matrix is shared across variables
lda.model <- lda(quality ~ ., data=train)
lda.model
## Call:
## lda(quality ~ ., data = train)
##
## Prior probabilities of groups:
## Good Poor
## 0.5431694 0.4568306
##
## Group means:
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## Good 8.512274 0.4697686 0.3025956 2.500000 0.08251509
## Poor 8.140909 0.5978828 0.2306220 2.513995 0.08933014
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates
## Good 14.93461 39.07847 0.9964496 3.308632 0.6958753
## Poor 16.26316 54.31938 0.9970373 3.312847 0.6026077
## alcohol
## Good 10.880952
## Poor 9.938397
##
## Coefficients of linear discriminants:
## LD1
## fixed.acidity -0.152123897
## volatile.acidity 2.667746755
## citric.acid 1.073837157
## residual.sugar -0.004992043
## chlorides 1.026237568
## free.sulfur.dioxide -0.020400565
## total.sulfur.dioxide 0.012741878
## density 42.780154258
## pH -0.054063521
## sulphates -2.580333476
## alcohol -0.667204041
probability = predict(lda.model, newdata=test)
head(probability$posterior)
## Good Poor
## 5 0.1914967 0.8085033
## 6 0.2073707 0.7926293
## 14 0.6015943 0.3984057
## 15 0.3883100 0.6116900
## 22 0.4075050 0.5924950
## 28 0.6330191 0.3669809
confusionMatrix(probability$class, test$quality)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Good Poor
## Good 91 30
## Poor 33 74
##
## Accuracy : 0.7237
## 95% CI : (0.6608, 0.7807)
## No Information Rate : 0.5439
## P-Value [Acc > NIR] : 0.00000001866
##
## Kappa : 0.4444
##
## Mcnemar's Test P-Value : 0.8011
##
## Sensitivity : 0.7339
## Specificity : 0.7115
## Pos Pred Value : 0.7521
## Neg Pred Value : 0.6916
## Prevalence : 0.5439
## Detection Rate : 0.3991
## Detection Prevalence : 0.5307
## Balanced Accuracy : 0.7227
##
## 'Positive' Class : Good
##
Using our LDA model to predict in the test set we get a 72% of accuracy which again is pretty good and cheap for our machines!
QDA assumes that each variable independently unlike LDA which mixes some of their statistic information. We don’t assume the shared covar here.
qda.model <- qda(quality ~ ., data=train)
qda.model
## Call:
## qda(quality ~ ., data = train)
##
## Prior probabilities of groups:
## Good Poor
## 0.5431694 0.4568306
##
## Group means:
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## Good 8.512274 0.4697686 0.3025956 2.500000 0.08251509
## Poor 8.140909 0.5978828 0.2306220 2.513995 0.08933014
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates
## Good 14.93461 39.07847 0.9964496 3.308632 0.6958753
## Poor 16.26316 54.31938 0.9970373 3.312847 0.6026077
## alcohol
## Good 10.880952
## Poor 9.938397
prediction = predict(qda.model, newdata=test)$class
confusionMatrix(prediction, test$quality)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Good Poor
## Good 89 32
## Poor 35 72
##
## Accuracy : 0.7061
## 95% CI : (0.6424, 0.7644)
## No Information Rate : 0.5439
## P-Value [Acc > NIR] : 0.0000003856
##
## Kappa : 0.4091
##
## Mcnemar's Test P-Value : 0.807
##
## Sensitivity : 0.7177
## Specificity : 0.6923
## Pos Pred Value : 0.7355
## Neg Pred Value : 0.6729
## Prevalence : 0.5439
## Detection Rate : 0.3904
## Detection Prevalence : 0.5307
## Balanced Accuracy : 0.7050
##
## 'Positive' Class : Good
##
We get a 70% accuracy score which is very good and a little bit worse than LDA.
##Conclusions on Classifiers We have used some of the most powerful and actual classifiers, almost all of them have showed a performance around 70% and almost reaching 85%. Adding a subsampling technique or a Cross Validation can indeed add more computing time depending on the tool but give us better results on our model. Our best model has been obtained with Random Forest and Subsampling with Smote. Our ensemble method has got a AUC of 85% which is great, but lower values for specificity, around 65%, that’s why RF is our best possible model.
Jumping into regression models, I believe that, due to the fact that many variables have got very little correlation, regression models aren’t going to work well but I’ll try to get the best possible models.
We encode the Good and Poor to 1’s and 0’s to try to improve the results. But keeping the original values should help us let’s see
set.seed(237)
spl3 = createDataPartition(df$quality, p = 0.8, list = FALSE)
train3 = df[spl,]
test3 = df[-spl,]
To start with I’ll use the variable that most classifiers have said to be the least important, pH
linFit <- lm(quality ~ pH, data=train2)
summary(linFit)
##
## Call:
## lm(formula = quality ~ pH, data = train2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.5944 -0.5391 0.4382 0.4600 0.4999
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.24093 0.35088 0.687 0.492
## pH 0.09063 0.10587 0.856 0.392
##
## Residual standard error: 0.4987 on 913 degrees of freedom
## Multiple R-squared: 0.0008021, Adjusted R-squared: -0.0002923
## F-statistic: 0.7329 on 1 and 913 DF, p-value: 0.3922
par(mfrow=c(2,2))
plot(linFit, pch=23 ,bg='orange',cex=2)
We are going to make a correlation test between our predictions and the real value, this R function uses the Spearman correlation test. The closer the correlation is to 1, the better.
pr.simple = predict(linFit, newdata=test2)
cor(test2$quality, pr.simple)^2
## [1] 0.001302806
In our case we have an insignificant value for it so, quality and pH are not correlated at all. If we use our new train and test sets we obtain
linFit <- lm(quality ~ pH, data=train3)
summary(linFit)
##
## Call:
## lm(formula = quality ~ pH, data = train3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6625 -0.6671 0.2605 0.3737 2.5230
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.1607 0.5621 12.738 < 0.0000000000000002 ***
## pH -0.4526 0.1696 -2.669 0.00775 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8039 on 913 degrees of freedom
## Multiple R-squared: 0.007739, Adjusted R-squared: 0.006652
## F-statistic: 7.121 on 1 and 913 DF, p-value: 0.007754
par(mfrow=c(2,2))
plot(linFit, pch=23 ,bg='orange',cex=2)
pr.simple = predict(linFit, newdata=test3)
cor(test3$quality, pr.simple)^2
## [1] 0.0083844
We can see that in the plots the model is not as good as it should be. The correlation test rises to 0.008, still almost 0. This tells us that a numeric output works better than the Good-Poor and that they are not correlated at all these two variables. Let’s see what happens when we use the most important variable.
q_s <- lm(quality ~ sulphates, data=train2)
summary(q_s)
##
## Call:
## lm(formula = quality ~ sulphates, data = train2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.3220 -0.4873 0.3201 0.4660 0.6178
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.15456 0.06221 2.484 0.0132 *
## sulphates 0.58372 0.09076 6.431 0.000000000204 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4879 on 913 degrees of freedom
## Multiple R-squared: 0.04334, Adjusted R-squared: 0.04229
## F-statistic: 41.36 on 1 and 913 DF, p-value: 0.0000000002038
par(mfrow=c(2,2))
plot(q_s, pch=23 ,bg='orange',cex=2)
With the Good-Poor outputs we reach a 0.004 score in the correlation test, which is still very poor, we can see that the graphics don’t show visual relations. If we use the Good-Poor we get…
pr.simple = predict(q_s, newdata=test2)
cor(test2$quality, pr.simple)^2
## [1] 0.1474362
The best score so far but still very poor If we use the new codification
q_s <- lm(quality ~ sulphates, data=train3)
summary(q_s)
##
## Call:
## lm(formula = quality ~ sulphates, data = train3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6268 -0.5353 0.0681 0.4342 2.3732
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.6658 0.1049 44.459 <0.0000000000000002 ***
## sulphates 1.5255 0.1559 9.786 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7678 on 913 degrees of freedom
## Multiple R-squared: 0.09494, Adjusted R-squared: 0.09395
## F-statistic: 95.78 on 1 and 913 DF, p-value: < 0.00000000000000022
par(mfrow=c(2,2))
plot(q_s, pch=23 ,bg='orange',cex=2)
pr.simple = predict(q_s, newdata=test3)
cor(test3$quality, pr.simple)^2
## [1] 0.00952285
We get a 0.09 which again is terribly poor. I will further use the new data encoding and partitions.
Just trying to predict the output from one of the variables is pretty useless, let’s add some more and see what happens.
linFit <- lm(quality ~ alcohol + sulphates + volatile.acidity + total.sulfur.dioxide + density, data=train2)
summary(linFit)
##
## Call:
## lm(formula = quality ~ alcohol + sulphates + volatile.acidity +
## total.sulfur.dioxide + density, data = train2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.00822 -0.34457 -0.02947 0.35730 0.99256
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.026499 8.481796 -0.711 0.478
## alcohol 0.180847 0.015377 11.761 < 0.0000000000000002 ***
## sulphates 0.355224 0.083761 4.241 0.000024536847 ***
## volatile.acidity -0.539778 0.083533 -6.462 0.000000000168 ***
## total.sulfur.dioxide -0.001960 0.000427 -4.591 0.000005028002 ***
## density 4.838487 8.436592 0.574 0.566
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4216 on 909 degrees of freedom
## Multiple R-squared: 0.2891, Adjusted R-squared: 0.2852
## F-statistic: 73.92 on 5 and 909 DF, p-value: < 0.00000000000000022
pr.multiple = exp(predict(linFit, newdata=test2))
cor(test2$quality, pr.multiple)^2
## [1] 0.2499848
We now get a 0.25 score, compared to the previous results is almost excellent, if analysed as a Pearson Correlation test, we get again very poor results. Also notice the R^2, if the metric is equal to 1, our results are great, in our case its close to 0 which means that our regression isn’t working quite good.
linFit <- lm(quality ~ alcohol + sulphates + volatile.acidity + total.sulfur.dioxide + density, data=train3)
summary(linFit)
##
## Call:
## lm(formula = quality ~ alcohol + sulphates + volatile.acidity +
## total.sulfur.dioxide + density, data = train3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.28595 -0.38208 -0.07113 0.42277 2.04302
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.8762683 13.1391597 -0.295 0.768049
## alcohol 0.2866089 0.0238527 12.016 < 0.0000000000000002 ***
## sulphates 0.8598139 0.1386065 6.203 0.000000000839 ***
## volatile.acidity -1.3049343 0.1229491 -10.614 < 0.0000000000000002 ***
## total.sulfur.dioxide -0.0023116 0.0006474 -3.570 0.000375 ***
## density 6.7998406 13.0613980 0.521 0.602768
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6392 on 909 degrees of freedom
## Multiple R-squared: 0.3754, Adjusted R-squared: 0.372
## F-statistic: 109.3 on 5 and 909 DF, p-value: < 0.00000000000000022
pr.multiple = exp(predict(linFit, newdata=test3))
cor(test3$quality, pr.multiple)^2
## [1] 0.280129
On the same task our new codification again shows a slight improvement so we will keep train3 & test3
linFit <- lm(quality ~ alcohol + sulphates + volatile.acidity + total.sulfur.dioxide + density + citric.acid + fixed.acidity + chlorides + free.sulfur.dioxide + pH + residual.sugar, data=train3)
summary(linFit)
##
## Call:
## lm(formula = quality ~ alcohol + sulphates + volatile.acidity +
## total.sulfur.dioxide + density + citric.acid + fixed.acidity +
## chlorides + free.sulfur.dioxide + pH + residual.sugar, data = train3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.56271 -0.35414 -0.06343 0.42627 1.87738
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31.7348072 27.2822529 1.163 0.245054
## alcohol 0.2655070 0.0347425 7.642 0.0000000000000545 ***
## sulphates 0.9258061 0.1500559 6.170 0.0000000010313528 ***
## volatile.acidity -1.1726953 0.1523609 -7.697 0.0000000000000365 ***
## total.sulfur.dioxide -0.0034247 0.0009252 -3.701 0.000227 ***
## density -27.9362144 27.8264886 -1.004 0.315675
## citric.acid -0.1023475 0.1903100 -0.538 0.590852
## fixed.acidity 0.0378093 0.0333770 1.133 0.257601
## chlorides -0.8818582 0.5721927 -1.541 0.123621
## free.sulfur.dioxide 0.0057317 0.0029370 1.952 0.051302 .
## pH -0.3520991 0.2459610 -1.432 0.152626
## residual.sugar 0.0135773 0.0215706 0.629 0.529224
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6353 on 903 degrees of freedom
## Multiple R-squared: 0.3872, Adjusted R-squared: 0.3797
## F-statistic: 51.86 on 11 and 903 DF, p-value: < 0.00000000000000022
pr.multiple = exp(predict(linFit, newdata=test3))
pr.multiple <-round(pr.multiple,0)
cor(test3$quality, pr.multiple)^2
## [1] 0.2890134
We keep on getting a better Pearson Coefficient value, but still low. Notice that in all the previous regression models, the importance given to the variables looks quite similar to the ones stated by the classification tools. Now we will use all the variables in the dataset to create a model.
model = quality ~ alcohol + sulphates + volatile.acidity + total.sulfur.dioxide + density + citric.acid + fixed.acidity + chlorides + free.sulfur.dioxide + pH + residual.sugar
linFit <- lm(model, data=train3)
ols <- ols_step_all_possible(linFit)
This shows all the possible combinations of the predictors and their Rsquare and other metrics
head(ols)
## Index N Predictors R-Square Adj. R-Square Mallow's Cp
## 1 1 1 alcohol 0.22010025 0.21924603 238.1657
## 3 2 1 volatile.acidity 0.18564479 0.18475283 288.9351
## 2 3 1 sulphates 0.09494239 0.09395109 422.5832
## 6 4 1 citric.acid 0.07607171 0.07505974 450.3887
## 4 5 1 total.sulfur.dioxide 0.03369279 0.03263441 512.8332
## 5 6 1 density 0.02603490 0.02496812 524.1169
If we now make some plots we can see how the more predictors we have the larger R-Square and Adjusted R-Square get.
plot(ols_step_forward_p(linFit))
summary(linFit)
##
## Call:
## lm(formula = model, data = train3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.56271 -0.35414 -0.06343 0.42627 1.87738
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31.7348072 27.2822529 1.163 0.245054
## alcohol 0.2655070 0.0347425 7.642 0.0000000000000545 ***
## sulphates 0.9258061 0.1500559 6.170 0.0000000010313528 ***
## volatile.acidity -1.1726953 0.1523609 -7.697 0.0000000000000365 ***
## total.sulfur.dioxide -0.0034247 0.0009252 -3.701 0.000227 ***
## density -27.9362144 27.8264886 -1.004 0.315675
## citric.acid -0.1023475 0.1903100 -0.538 0.590852
## fixed.acidity 0.0378093 0.0333770 1.133 0.257601
## chlorides -0.8818582 0.5721927 -1.541 0.123621
## free.sulfur.dioxide 0.0057317 0.0029370 1.952 0.051302 .
## pH -0.3520991 0.2459610 -1.432 0.152626
## residual.sugar 0.0135773 0.0215706 0.629 0.529224
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6353 on 903 degrees of freedom
## Multiple R-squared: 0.3872, Adjusted R-squared: 0.3797
## F-statistic: 51.86 on 11 and 903 DF, p-value: < 0.00000000000000022
We still have the same importance in some variables.
predictions <- predict(linFit, newdata=test3)
cor(test3$quality, predictions)^2
## [1] 0.3023072
Luckily for us, the Pearson Correlation Coefficient has risen up to 30. Let’s try some Cross Validation
ctrl <- trainControl(method = "repeatedcv",
number = 5, repeats = 1)
lm_tune <- train(model, data = train3,
method = "lm",
preProc=c('scale', 'center'),
trControl = ctrl)
p <- predict(lm_tune, test3)
postResample(pred = p, obs = test3$quality)
## RMSE Rsquared MAE
## 0.6720243 0.3023072 0.5263898
We have now improved our previous results, the closer the RMSE is to 0 the better, ours is very close to 1 which isn’t great
qplot(p, test3$quality) +
labs(title="Backward Regression Observed VS Predicted", x="Predicted", y="Observed") +
geom_abline(intercept = 0, slope = 1, colour = "blue") +
theme_bw()
We can see that the model has predicted well very few samples, also notice that some of them have float values when our output is an integer, so let’s round our predictions and see how it behaves.
q <- predict(lm_tune, test3)
q <- round(q,0)
postResample(pred = q, obs = test3$quality)
## RMSE Rsquared MAE
## 0.7254763 0.2438762 0.4473684
Our RMSE is worse so the results aren’t better than the previous ones
qplot(q, test3$quality) +
labs(title="Backward Regression Observed VS Predicted", x="Predicted", y="Observed") +
geom_abline(intercept = 0, slope = 1, colour = "blue") +
theme_bw()
Only 3 wines have been classified correctly, which is very poor.
In short words is a method of regressing multiple variables while simultaneously removing those that aren’t important, in our case we more or less know which ones are
set.seed(237)
step_tune <- train(model, data = train3,
method = "leapSeq",
preProc=c('scale', 'center'),
tuneGrid = expand.grid(nvmax = 4:10),
trControl = ctrl)
plot(step_tune)
Our ideal number of predictors is 8, note that the seed 237 allows us to maintain the results because if we hadn’t use this tool this plot will have different results each time we run it.
coef(step_tune$finalModel, step_tune$bestTune$nvmax)
## (Intercept) alcohol sulphates
## 5.662295082 0.310927248 0.144790092
## volatile.acidity total.sulfur.dioxide fixed.acidity
## -0.210431019 -0.118812716 0.009800311
## chlorides free.sulfur.dioxide pH
## -0.043413323 0.060802594 -0.076305216
Again alcohol and sulphates are the more relevant variables as in the previous models.
r2 <- predict(step_tune, test3)
postResample(pred = r2, obs = test3$quality)
## RMSE Rsquared MAE
## 0.6715909 0.3033776 0.5249478
But with these regression techniques we still get very poor results.
Ridge regression is used when variables are highly correlated between them, is not our case but let’s how does it behave when we use it in the worse possible scenario, very poor correlation between many predictors.
model = quality ~ alcohol + sulphates + volatile.acidity + total.sulfur.dioxide + pH
ridge_grid <- expand.grid(lambda = seq(0, .1, length = 100))
ridge_tune <- train(model, data = train3,
method='ridge',
preProc=c('scale','center'),
tuneGrid = ridge_grid,
trControl=ctrl)
plot(ridge_tune)
Even if we used the best tuning parameter, isn’t going to make an important difference
ridge_tune$bestTune
## lambda
## 1 0
r <- predict(ridge_tune, test3)
postResample(pred = r, obs = test3$quality)
## RMSE Rsquared MAE
## 0.6775673 0.2917990 0.5237511
As we believed, Ridge does not work properly in our dataset, it makes our bad results on regression even worse.
Performs variable selection and regularization in order to get better predictions, is it going to make our results to improve?
lasso_grid <- expand.grid(fraction = seq(.01, 1, length = 100))
lasso_tune <- train(model, data = train3,
method='lasso',
preProc=c('scale','center'),
tuneGrid = lasso_grid,
trControl=ctrl)
plot(lasso_tune)
lasso_tune$bestTune
## fraction
## 99 0.99
r <- predict(lasso_tune, test3)
postResample(pred = r, obs = test3$quality)
## RMSE Rsquared MAE
## 0.6770365 0.2922971 0.5237219
We still have poor results
Elasticnet is a combination of L1 (lasso) and L2 (ridge), helps a lot when we have high correlated variables or just useless variables. I assume that ir won’t work good
modelLookup('glmnet')
## model parameter label forReg forClass probModel
## 1 glmnet alpha Mixing Percentage TRUE TRUE TRUE
## 2 glmnet lambda Regularization Parameter TRUE TRUE TRUE
elastic_grid = expand.grid(alpha = seq(0, .2, 0.01), lambda = seq(0, .1, 0.01))
glmnet_tune <- train(model, data = train3,
method='glmnet',
preProc=c('scale','center'),
tuneGrid = elastic_grid,
trControl=ctrl)
plot(glmnet_tune)
glmnet_tune$bestTune
## alpha lambda
## 14 0.01 0.02
r <- predict(glmnet_tune, test3)
postResample(pred = r, obs = test3$quality)
## RMSE Rsquared MAE
## 0.6772749 0.2911153 0.5243533
We again get regularizing parameters very low, the model just doesn’t work, mainly because our variables are little or no correlated at all
We can see that for this particular case and as we believed at the very beggining of the Homework, wine classification is not a linear problem but a classification one. We could use a random predictor with 0.5 chance of getting a poor or bad wine and get better results. It is true that we are predicting a number instead of a label but what has seemed more curious is that if we rounded the predictions we got worse and indeed makes a lot of sense because if the real value was 5 for example, and our prediction was 5.5, our RMSE will be lower than having rounded it to 6.
MAE and RMSE are close to 0, in some cases get a bit closer to 0.5 but they tell us that our regression isn’t working. MAE isn’t great when dealing with outliers and we have some of them in almost every variable, unfortunately we can’t get rid of them. RMSE is usually higher than MAE which means that our errors have got a big variance.
It is widely spread that Random Forest is one of the best ML methods nowadays, and in this project we have proven so, in addition if we use some cross validation or subsampling, the better our future predictions will be. This is not an easy dataset to analyze due to the great variety of data we have got for each variable, meaning that a good wine could have the same value for acidity than a terrible wine, making it difficult for Classifiers and just horrendous for Regression as variables have got very little or no correlation.
Random Forest has outperformed all the other Classifiers and Regression Techniques, with a 77% of accuracy, besides the Ensemble Methods with an AUC of almost 85% a bit better than our RF but more expensive because we have to train, validate and test more models but is ok for our dataset. We just use RF and the Decision Tree on the ensemble and worked pretty good. I’ll personally choose Random Forest as it has all the performance metrics around 77% and that is even better.
We can see that the wine quality classification problem is not linear but all predictors have came up with the following conclusions: 1· Alcohol and Sulfates are determinant when determine if a wine is good or poor 2· pH, sugar or residuals of other elements are not that important 3· No regression can be done properly to the data
ggplot(df) +
aes(x = alcohol) +
geom_histogram(bins = 30L, fill = "#0C4C8A") +
labs(title = "Quality vs Alcohol") +
theme_minimal() +
facet_wrap(vars(quality))
Notice that wines with lower quality tend to have less alcohol as they usually age less than good wines. Good wines vary in alcohol but are normally higher, but sometimes have low values of alcohol and this is why in some occasions the specificity or the sensitivity were lower, as the model gave a lot of importance to volatile acidity, another facet with a lot of importance.
ggplot(df) +
aes(x = volatile.acidity) +
geom_histogram(bins = 30L, fill = "#112446") +
labs(title = "Volatile Acidity") +
theme_minimal() +
facet_wrap(vars(quality))
As quality increases, Acidity decreases so, more acid wines are worse.